R Package Required:

# Import packages
library(tidyverse)
library(GGally)
library(gridExtra)
library(corrgram)
library(plotly)
library(knitr)
library(ggridges)

Introduction

“If there is magic on this planet, it is contained in water.” — Loren Eiseley, anthropologist, educator, philosopher, and natural science writer.

In this assignment, we attempt to unfold the magic of water in wetlands, an important ecosystem that supports numerous ecofunctions such as biodiversity. We will explore the relationships between several physiochemical properties of water:

Is there a strong correlation between temperature and other water properties? What are some general trends of the properties from year to year? Can pH level be used to generalize the properties of a water body?

Utilizing Alberta Biodiversity Monitoring Institute’s Wetland Habitat Water Physiochemistry dataset, we carry out a series of data wrangling and exploratory data analysis to examine and visualize relationships among variables.

Understanding the relationships of these water properties is a stepping stone that will aid further evaluation and protection of healthy wetlands.


Data Selection and Preparation

About the Data

The Raw Water Physiochemistry dataset (ABMI dataset) was requested and downloaded from Alberta Biodiversity Monitoring Institute (ABMI). The data were recorded by ABMI’s field crew to document the water physiochemistry measurements in pre-determined wetland sites across Alberta from 2007 to 2019. Data were collected in compliance with ABMI’s Wetland Field Data Collection Protocols (ABMI 2019), attributes in the dataset are summarized in Table 1 below.

Table 1 - Attribute description of Wetland W04 Water Physiochemistry Raw Data

attribute <- read_csv('ABMI_attribute.csv',show_col_types = FALSE)
select(attribute, c(1:3))

Data Preparation and Understanding the Data

# Import data
abmi <- read_csv('A_W04_Water_Physiochemistry_3970548287569838812.csv')

Table 2 - The first 6 rows of the ABMI dataset

head(abmi)

There are a total of 5244 records of 17 variables.

Data preparation

Our next steps are to remove redundant fields that are irrelevant, including:

  • Field Date - sample dates were already standardized to take place between June 15 and July 31 (ABMI 2019);

  • Time - sample times were standardized to take place between 1-2pm (ABMI 2019);

  • Turbidity - the value was not recorded until 2017 (ABMI 2019), which is more than half of the records;

According to ABMI 2019, the survey records were taken from the middle of the water column if the water is <2 m deep, and at 1 m depth if the water is >2 m deep, and the nutrient sample were mixed from all 3 locations, i.e. same recording for all 3 locations within the same wetland site. Therefore the Location does not contribute much to the variation of recorded value and it can be omitted from the dataset. A better approach would be to consolidate the mean values and compare among wetland sites instead of the 3 locations within each wetland site (this is completed in the next section).

In addition, column names are abbreviated in preparation for easier reference in later steps.

# subset data
abmi_tidy <- abmi[,-c(4,6,7,17)]

# rename column names
abmi_tidy <- abmi_tidy %>% 
  rename("Field_Crew" = "Field Crew Member(s)",
         "Depth" = "Depth (metres)",
         "Temp" = "Temperature (Degrees Celsius)",
         "DO" = "Dissolved Oxygen (milligrams/Litre)",
         "Cond" = "Conductivity (mSiemens/cm)",
         "Salinity" = "Salinity (parts per thousand)",
         "Total_N" = "Total Nitrogen (micrograms/Litre)", 
         "Total_P" = "Total Phosphorous (micrograms/Litre)",
         "DOC" = "Dissolved Organic Carbon (milligrams/Litre)")

# show structure of the dataset
str(abmi_tidy)
## tibble [5,244 x 13] (S3: tbl_df/tbl/data.frame)
##  $ Rotation  : chr [1:5244] "Rotation 2" "Rotation 2" "Rotation 2" "Rotation 2" ...
##  $ ABMI Site : chr [1:5244] "329" "329" "329" "330" ...
##  $ Year      : num [1:5244] 2015 2015 2015 2015 2015 ...
##  $ Field_Crew: chr [1:5244] "SLO" "SLO" "SLO" "JAD2" ...
##  $ Depth     : chr [1:5244] "0.4" "0.3" "0.3" "2.6" ...
##  $ Temp      : chr [1:5244] "22.31" "22.63" "22.5" "19.23" ...
##  $ pH        : chr [1:5244] "10.19" "10.11" "10.13" "8.28" ...
##  $ DO        : chr [1:5244] "13.77" "15.42" "16.25" "7.62" ...
##  $ Cond      : chr [1:5244] "0.169" "0.164" "0.171" "0.2" ...
##  $ Salinity  : chr [1:5244] "0.08" "0.08" "0.07" "0.1" ...
##  $ Total_N   : chr [1:5244] "2370" "2370" "2370" "608" ...
##  $ Total_P   : chr [1:5244] "242" "242" "242" "17" ...
##  $ DOC       : chr [1:5244] "27.8" "27.8" "27.8" "12.4" ...

Note the data types for the physiochemical properties of water are shown as characters. They need to be converted to numeric data types.

abmi_tidy <- abmi_tidy %>% 
  mutate_at(c(5:13), as.numeric)

Group the data by site records and convert the dataset to a tibble

As mentioned earlier, our next step is to group the observations by wetland site per visit (i.e. condense 3 records to 1 if the site was visited once; if the site was visited twice in both rotations, then there should be 2 separate records, 1 for each visit), using the mean value of physiochemical properties for each location within the wetland site.

abmi_site <- summarise(
  group_by(abmi_tidy, Rotation, `ABMI Site`, `Field_Crew`),
  Year = mean(Year),
  Depth_m = mean(Depth),
  Temp_m = mean(Temp),
  pH_m = mean(pH),
  DO_m = mean(DO),
  Cond_m = mean(Cond),
  Salinity_m = mean(Salinity),
  Total_N_m = mean(Total_N),
  Total_P_m = mean(Total_P),
  DOC_m = mean(DOC)
)

abmi_site <- as_tibble(abmi_site)

Table 3 - Data summary of ABMI dataset after grouping the data by site visit

knitr::kable(summary(abmi_site))
Rotation ABMI Site Field_Crew Year Depth_m Temp_m pH_m DO_m Cond_m Salinity_m Total_N_m Total_P_m DOC_m
Length:1748 Length:1748 Length:1748 Min. :2007 Min. : 0.200 Min. : 5.063 Min. : 3.507 Min. : 0.1033 Min. : 0.00167 Min. : 0.0000 Min. : 51 Min. : 3.0 Min. : 0.90
Class :character Class :character Class :character 1st Qu.:2011 1st Qu.: 1.167 1st Qu.:17.996 1st Qu.: 7.257 1st Qu.: 6.2467 1st Qu.: 0.11683 1st Qu.: 0.0600 1st Qu.: 989 1st Qu.: 32.0 1st Qu.: 18.52
Mode :character Mode :character Mode :character Median :2014 Median : 1.667 Median :19.738 Median : 8.150 Median : 7.8200 Median : 0.26167 Median : 0.1267 Median : 1570 Median : 65.0 Median : 28.35
NA NA NA Mean :2014 Mean : 2.445 Mean :19.580 Mean : 8.130 Mean : 7.7554 Mean : 0.99101 Mean : 0.5963 Mean : 2730 Mean : 331.3 Mean : 36.49
NA NA NA 3rd Qu.:2017 3rd Qu.: 2.633 3rd Qu.:21.422 3rd Qu.: 8.997 3rd Qu.: 9.3333 3rd Qu.: 0.70517 3rd Qu.: 0.3700 3rd Qu.: 2655 3rd Qu.: 229.5 3rd Qu.: 40.50
NA NA NA Max. :2019 Max. :41.167 Max. :29.530 Max. :12.100 Max. :22.4500 Max. :52.13333 Max. :33.9267 Max. :761600 Max. :25600.0 Max. :596.60
NA NA NA NA NA’s :10 NA’s :8 NA’s :11 NA’s :15 NA’s :177 NA’s :36 NA’s :5 NA’s :5 NA’s :6

Note 177 null values in Conductivity, that’s 10% of the total records. Upon examination, we see that most null values were in 2012 survey, so we can either eliminate all records with a null Conductivity value, or eliminate just the Conductivity column. Since the values are valid in most of other columns and it is in our interest to explore the relationships of general water properties, it is best to forego the Conductivity column in this case to maintain data integrity.

Other records with random null values are then omitted from further analysis. Rotation, Year, and Field_Crew variables are converted from character data types to factors.

# Remove missing values
abmi_new <- abmi_site[,-9]
abmi_new <- na.omit(abmi_new)

# Convert corresponding variables to factors
abmi_new <- abmi_new %>%
  mutate_at(c(1,3:4), as.factor)

Note that there are 99 combinations for Field Crew Members, as ABMI recruits new crews every year to collect data. This factor may not reveal any relationships to the whole dataset, however it may be helpful to evaluate any relationships within a given year where the crew is more consistent and evaluate protocol compliance by the crews. Hence we will leave the column in for now.


Data Summary

1695 records with 12 variables, including 1 character variable for ABMI Site ID, 3 categorical variables for Rotation(survey periods), Field_Crew, Year and 8 numeric variables for physiochemical properties of sampled water.

Table 4 - Variables and data types in ABMI dataset

data.frame(Variable = names(abmi_new),
           Type = sapply(abmi_new, typeof),
           row.names = NULL) %>% kable()
Variable Type
Rotation integer
ABMI Site character
Field_Crew integer
Year integer
Depth_m double
Temp_m double
pH_m double
DO_m double
Salinity_m double
Total_N_m double
Total_P_m double
DOC_m double

Table 5 - Data summary of ABMI dataset after clean-up

knitr::kable(summary(abmi_new))
Rotation ABMI Site Field_Crew Year Depth_m Temp_m pH_m DO_m Salinity_m Total_N_m Total_P_m DOC_m
Rotation 1:1271 Length:1695 JSW : 54 2013 :199 Min. : 0.200 Min. : 5.063 Min. : 3.507 Min. : 0.1033 Min. : 0.0000 Min. : 51.0 Min. : 3.0 Min. : 0.90
Rotation 2: 424 Class :character RHO2 : 52 2014 :169 1st Qu.: 1.167 1st Qu.:17.972 1st Qu.: 7.265 1st Qu.: 6.2517 1st Qu.: 0.0600 1st Qu.: 991.5 1st Qu.: 32.0 1st Qu.: 18.50
NA Mode :character CMC : 50 2011 :165 Median : 1.667 Median :19.727 Median : 8.163 Median : 7.8267 Median : 0.1233 Median : 1578.0 Median : 65.0 Median : 28.30
NA NA JZS : 42 2017 :165 Mean : 2.440 Mean :19.569 Mean : 8.140 Mean : 7.7741 Mean : 0.5971 Mean : 2751.1 Mean : 333.6 Mean : 36.35
NA NA JBU : 41 2012 :153 3rd Qu.: 2.633 3rd Qu.:21.418 3rd Qu.: 9.008 3rd Qu.: 9.3550 3rd Qu.: 0.3700 3rd Qu.: 2669.0 3rd Qu.: 235.5 3rd Qu.: 40.35
NA NA KBE : 41 2015 :151 Max. :41.167 Max. :29.530 Max. :12.100 Max. :22.4500 Max. :33.9267 Max. :761600.0 Max. :25600.0 Max. :596.60
NA NA (Other):1415 (Other):693 NA NA NA NA NA NA NA NA

Exploration of Variation

Categorical Variables

Figure 1 - Stacked bar chart by Year and Field Crew

Year_plot <- ggplot(abmi_new, aes(x=Year, fill=Field_Crew))+
  geom_bar(position="stack")+
  ylab ("Count by Field Crew")+
  theme(legend.position = "none")
ggplotly(Year_plot)

This interactive bar chart shows the number of records obtained by each field crew for each year. It is evident that the number of recordings are not consistent from year to year, with lower number in earlier years (2007-2010). Various factors including funding, workforce size, extreme weather can affect the number of recordings. This should be noted when performing further analysis and examining year-based patterns.


Numerical Variables

Figure 2 - Combination boxplots for Temperature and pH values

Temp_box <- ggplot(abmi_new, aes(x=Temp_m))+
  geom_boxplot()+
  xlab("Temperature °C")+
  theme(legend.position = "none",
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank())

pH_box <- ggplot(abmi_new, aes(x=pH_m))+
  geom_boxplot()+
  xlab("pH")+
  theme(legend.position = "none",
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank())

grid.arrange(Temp_box, pH_box, nrow=1)

The Temperature boxplot shows quite a wide range of temperature recordings, from around 5°C to almost 30°C. Majority of the recordings sit between 18°C and 22°C.

The pH boxplot shows that all recordings were between 3 and 12, which is within the pH range of 0-14. In a nutshell, most water appear to be slightly basic with pH level over 7.

Figure 3 - Combination density plots for depth, dissolved oxygen, and salinity

Depth_plot <- ggplot(abmi_new,aes(x=Depth_m))+
  geom_density(fill='#69b3a2', color='#e9ecef')+
  xlab("Wetland Depth (m)")+
  ylab("Density")

DO_plot <- ggplot(abmi_new,aes(x=DO_m))+
  geom_density(fill='coral1', color='coral2', alpha=0.8)+
    xlab("Dissolved Oxygen(mg/L)")+
    ylab("Density")

Salinity_plot <- ggplot(abmi_new, aes(x=Salinity_m))+
  geom_density(fill='cornflowerblue', color='cornflowerblue', alpha=0.8)+
    xlab("Salinity (ppt)")+
    ylab("Density")

grid.arrange(Depth_plot, DO_plot, Salinity_plot, nrow=1)

The density plots shown above provide a visualized distribution of the three variables. Most wetlands are under 10m deep and salinity less than 5 parts per thousand. There is a wider range of dissolved oxygen values.

Figure 4A - Combination histogram for nutrient content

N_plot <- ggplot(abmi_new, aes(x=Total_N_m))+
  geom_histogram(bins = 50, col = "white", fill = "turquoise")+
  xlab("Total Nitrogen (µg/L)")+
  ylab("Count")

P_plot <- ggplot(abmi_new, aes(x=Total_P_m))+
  geom_histogram(bins = 50, col = "white", fill = "purple")+
  xlab("Total Phosphorous (µg/L)")+
  ylab("Count")

DOC_plot <- ggplot(abmi_new,aes(x=DOC_m))+
  geom_histogram(bins = 50, col = "white", fill = "coral")+
  xlab("Dissolved Organic Carbon (mg/L)")+
  ylab("Count")

grid.arrange(N_plot, P_plot, DOC_plot, nrow=1)

From the histogram plots, note the positively skewed distribution occurs in all 3 nutrient properties. The values for Total Nitrogen are extremely skewed with the maximum value to be over 10,000 times than the minimum value. Upon more read-up on the wetland survey methods, we found out the applicable range for Total Nitrogen is 0.05-10.0 mg/L (EPA Method 353.2).

Similar instances occur in the Total Phosphorous values, where applicable concentration range for the method used is 0.01 to 6 mg/L (Standard Method 4500-P Phosphorous).

Any values recorded outside the range specified in the Methods are subject to errors and may affect further data analysis. Therefore, we will remove these values, and re-evaluate their distribution. The range for dissolved organic carbon has not been validated so we will not filter out any DOC values.

Figure 4B - Combination histogram for Total Nitrogen and Total Phosphorous values post filter

# Filter out invalid data
abmi_filter <- abmi_new %>% 
  filter(Total_P_m <= 6000) %>% 
  filter(Total_N_m <= 10000)

N_plot <- ggplot(abmi_filter, aes(x=Total_N_m))+
  geom_histogram(bins = 50, col = "white", fill = "turquoise")+
  xlab("Total Nitrogen (µg/L)")+
  ylab("Count")

P_plot <- ggplot(abmi_filter, aes(x=Total_P_m))+
  geom_histogram(bins = 50, col = "white", fill = "purple")+
  xlab("Total Phosphorous (µg/L)")+
  ylab("Count")

grid.arrange(N_plot, P_plot, nrow=1)


Exploration of Co-variation

Correlogram

Figure 5 - Correlogram of all numeric variables

corrgram(abmi_filter, lower.panel=panel.shade, upper.panel = panel.pts)

The correlogram is a great starting point to investigate any relationships between the water physiochemical properties. In the graph shown above, blue boxes in the lower graphs indicate positive correlation while the red boxes indicate negative correlation. The darker the blue or red, the higher level of correlation. In this dataset, the most positive correlation exists between Total_N and Total_P, as well as between Total_N and DOC, while the most negative correlation lies between Depth and Total_N.

We can then investigate the relationship between Total_N and Total_P with more details.

Figure 6 - Heatmap of total Nitrogen and total Phosphorous content

ggplotly(abmi_filter %>% 
  ggplot(aes(x=Total_N_m, Total_P_m))+
  geom_hex(alpha=0.8)+
  scale_fill_gradient(low = "cadetblue", high = "darkslategrey")+
  geom_smooth(method = 'lm', color = "cyan")+
  geom_rug(col="steelblue", alpha=0.1)+
  xlab("Total Nitrogen (µg/L)")+
  ylab("Total Phosphorous (µg/L)"))

Note the majority of records are distributed in the lower left corner of the graph, with a seemingly positive correlation between the two variables.


Grouped Scatter Plot Matrix

Figure 7 - Scatter plot matrix of all numeric variables grouped by pH

abmi_filter %>% 
  select(c(5:12)) %>%
  mutate(pH = ifelse(pH_m < 7, "acidic", 
                     (ifelse(pH_m >= 7 & pH_m <9, "slightly basic", "basic")))) %>% 
  ggpairs(mapping = ggplot2::aes(color = pH, alpha = 0.2),
          lower = list(continuous = wrap("points", alpha = 0.2, size = 0.1), combo=wrap("dot_no_facet", alpha=0.3, size = 0.1)),
          upper = NULL)+
  theme(axis.line = element_blank(), axis.text = element_blank(), 
        axis.ticks = element_blank())

Although the correlations between pH values and other water properties were not obvious, we can further categorize pH and examine if any patterns exist between levels of acidity. Since there weren’t many pH records of less than 7, all records of less than 7 were categorized as acidic, those equal to or greater than 7 but less than 9 as slightly basic, and those greater than and equal to 9 as basic.

It is interesting to note the distribution of Total_N in relation to acidity, where acidic water tend to have lower Nitrogen value.


Yearly Variation

Figure 8 - Temperature variation across year

ggplotly(abmi_filter %>% 
  ggplot(aes(x=Year, y = Temp_m))+
  geom_boxplot(fill="slateblue", alpha=0.2)+
  stat_summary(fun=mean, geom="point", shape=15, size=3,
               color='darkcyan')+
  ylab("Temperature°C"))

Figure 9 - Total nitrogen variation across year

ggplotly(abmi_filter %>% 
  ggplot(aes(x=Year, y = Total_N_m))+
  geom_boxplot(fill="slateblue", alpha=0.2)+
  stat_summary(fun=mean, geom="point", shape=15, size=3,
               color='coral')+
  ylab("Total Nitrogen (µg/L)"))

Based on the results from Figure 8 and 9, the temperature and total nitrogen varies from year to year and there is no visible linear trend from 2007 to 2019. It will be interesting to see if there is any relationship to average air temperature in July. However it should be noted that the survey sites span across the whole Province of Alberta, with large temperature variation among sites.


Data Collection Standardization

Figure 9 - Ridgeline plot of dissolved oxygen variation across field crews

abmi_filter %>% 
  filter(Year == 2016) %>% 
  select(Field_Crew, DO_m) %>% 
  ggplot(aes(x=DO_m, y=Field_Crew, fill=Field_Crew))+
  geom_density_ridges(stat="density_ridges")+
  theme_ridges()+
  theme(legend.position = "none")+
  theme(panel.spacing = unit(0.1, "lines"),
        strip.text.x = element_text(size = 8))+
  xlab("Dissolved Oxygen (mg/L)")+
  ylab("Field Crew")

We can examine the distribution of variable values grouped by each field crew to determine potential operational discrepancies among field crews. Here we selected dissolved oxygen variable collected in 2016. According to the ridgeline plot, there is no field crew that significantly contribute to data skewing, although it seems that crew “KLE” tend to record lower DO values when compared with other field crews in 2016.


Discussion

Exploratory data analysis of the ABMI Wetland Physiochemistry dataset allows us to better understand the data through visualization, and to explore patterns and relationships within and between variables. By plotting the variables, it was easy to spot the outliers and identify needs to further investigate the data. Plotting the variables also provide a better sense of data distribution as well as preliminary correlations between variables that helps us understand the data.

Most numeric variables are distributed positively skewed, in part due to a pre-conditioned lower limit of 0 while with no restriction on the maximum value. It is worth investigating such pattern and determine if there is any potential for inaccurate data input.

The strongest relationships exist between total nitrogen and 3 variables - total phosphorous, dissolved organic carbon, and depth. In contrary to anticipation, there is no strong correlation between temperature and other water properties, and no significant indication that pH can be used to generalize physiochemical properties of a water body. We examined temperature and total nitrogen variations across 2007 to 2019 and did not find any visible trends.

Based on the exploratory data analysis results, it seems that the magic of water is far more complex than what our variables in the dataset can explain. However, total nitrogen variable proved to be the most sensitive to changes from other variables, and its relationship to other organic nutrient characteristics such as total phosphorous values and dissolved organic carbon should be examined further.


Acknowledgement & Reference

Raw water physiochemistry data (2007-2019 inclusive) from the Alberta Biodiversity Monitoring Institute was used, in whole or part, to complete this assignment. More information on the Institute can be found at: http://www.abmi.ca

Reference